---
name: matchms
description: Process, clean, and compare mass spectrometry (MS/MS) spectra with Matchms; use when you need reproducible spectral filtering and similarity scoring for metabolomics workflows.
license: MIT
author: aipoch
---
> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)
# Matchms Skill

## When to Use

- Use this skill when you need process, clean, and compare mass spectrometry (ms/ms) spectra with matchms; use when you need reproducible spectral filtering and similarity scoring for metabolomics workflows in a reproducible workflow.
- Use this skill when a data analytics task needs a packaged method instead of ad-hoc freeform output.
- Use this skill when the user expects a concrete deliverable, validation step, or file-based result.
- Use this skill when `scripts/similarity_pipeline.py` is the most direct path to complete the request.
- Use this skill when you need the `matchms` package behavior rather than a generic answer.

## Key Features

- Scope-focused workflow aligned to: Process, clean, and compare mass spectrometry (MS/MS) spectra with Matchms; use when you need reproducible spectral filtering and similarity scoring for metabolomics workflows.
- Packaged executable path(s): `scripts/similarity_pipeline.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `Third-party packages`: `not explicitly version-pinned in this skill package`. Add pinned versions if this skill needs stricter environment control.

## Example Usage

```bash
cd "20260316/scientific-skills/Data Analytics/matchms"
python -m py_compile scripts/similarity_pipeline.py
python scripts/similarity_pipeline.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/similarity_pipeline.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/similarity_pipeline.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## 1. When to Use

Use this skill when you need to:

- Import and harmonize MS/MS spectra from common community formats (e.g., MGF/MSP) before analysis.
- Clean spectra (peak filtering, intensity normalization) to improve downstream similarity scoring and identification.
- Compute spectral similarity (Cosine/Modified Cosine/Fingerprint-based) for library matching or clustering.
- Build reproducible, configurable processing pipelines for metabolomics projects.
- Compare many spectra efficiently (all-vs-all or query-vs-library) and store/inspect score outputs.

## 2. Key Features

- **Import/Export support**: Read spectra from mzML, mzXML, MGF, MSP, and JSON (depending on installed readers).
- **Filtering & harmonization**: Metadata standardization, peak cleaning, intensity normalization, and other reusable filters.
- **Similarity scoring**:
  - Cosine similarity (Greedy/Hungarian variants)
  - Modified Cosine (accounts for precursor mass shifts)
  - Fingerprint-based similarities (when molecular fingerprints are available)
- **Pipeline composition**: Chain filters and scoring steps into repeatable workflows.

Additional reference material (if present in the repository):
- Filters: `references/filtering.md`
- Similarity: `references/similarity.md`
- Workflows: `references/workflows.md`

## 3. Dependencies

- `matchms` (version depends on your environment; pin in your project, e.g., `matchms>=0.20,<1.0`)
- `numpy` (e.g., `numpy>=1.20`)
- `scipy` (e.g., `scipy>=1.7`)
- `rdkit` (optional; required for chemistry/fingerprint-related functionality, version varies by distribution)

## 4. Example Usage

A minimal, runnable example that loads spectra from an MGF file and computes pairwise cosine scores:

```python
from matchms.importing import load_from_mgf
from matchms import calculate_scores
from matchms.similarity import CosineGreedy

def main():
    # Load spectra from an MGF file
    spectra = list(load_from_mgf("data.mgf"))

    # Compute similarity scores (all-vs-all)
    scores = calculate_scores(
        references=spectra,
        queries=spectra,
        similarity_function=CosineGreedy()
    )

    # Iterate over computed scores
    for (reference_idx, query_idx, score, n_matches) in scores:
        print(
            f"ref={reference_idx:>3} query={query_idx:>3} "
            f"cosine={score:.4f} matches={n_matches}"
        )

if __name__ == "__main__":
    main()
```

## 5. Implementation Details

- **Data model**: Matchms operates on `Spectrum` objects containing peak m/z and intensity arrays plus metadata (e.g., precursor m/z, charge, compound name/identifier).
- **Filtering stage**: Typical pipelines apply filters to:
  - standardize/repair metadata fields,
  - remove noise peaks (e.g., by intensity threshold or m/z window rules),
  - normalize intensities (commonly to a maximum of 1.0 or to unit norm).
  See `references/filtering.md` for filter patterns and recommended sequences.
- **Cosine similarity (Greedy/Hungarian)**:
  - Peaks are matched within an m/z tolerance (implementation-specific defaults; configure via the similarity class parameters).
  - **Greedy** matching selects best available peak matches iteratively.
  - **Hungarian** matching solves an assignment problem to maximize total match score under one-to-one constraints.
- **Modified Cosine**:
  - Extends cosine matching by allowing peak alignment with a precursor mass shift, improving matching for related compounds/adducts.
  - Typically requires precursor m/z metadata to be present and consistent.
- **Fingerprint similarity (optional)**:
  - Requires molecular fingerprints (often derived via RDKit) and compares spectra/compounds using fingerprint similarity metrics.
  - Use when you have structure annotations or can compute fingerprints reliably.
- **Workflow reproducibility**:
  - Prefer explicit, ordered filter chains and pinned dependency versions.
  - Store configuration (tolerances, normalization choices, filters used) alongside results for traceability.
  See `references/workflows.md` for pipeline organization guidance.